Load the data

The high-throughput poly(A) RNA-seq data used in this notebook are described in Neeves et al, Brain (2022). They are derived from nuclear and cytoplasmic fractions of human induced pluripotent stem cells (hiPSC; day 0), neural precursors (NPC; day 3 and day 7), ā€˜patterned’ precursor motor neurons (ventral spinal cord; pMN; day 14), post-mitotic but electrophysiologically immature motor neurons (MN; day 22), and electrophysiologically active MNs (mMNs; day 35).

Schematic depicting the iPSC differentiation strategy for motor neurogenesis. Arrows indicate sampling time-points in days when cells were fractionated into nuclear and cytoplasmic compartments prior to deep (polyA) RNA-sequencing. Four iPSC clones were obtained from four different healthy controls and three iPSC clones from two ALS patients with VCP mutations: R155C and R191Q; hereafter termed VCPmu. Induced-pluripotent stem cells (iPSC); neural precursors (NPC); ā€œpatternedā€ precursor motor neurons (ventral spinal cord; pMN); post-mitotic but electrophysiologically inactive motor neurons (MN); electrophysiologically active MNs (mMN). The gene expression count data was obtained from Kallisto (Bray et al., 2016) using the Gencode hg38 release Homo sapiens transcriptome. The quantification of alternative splicing was performed using VAST-TOOLS (Tapial et al., 2017). The data required for this practical session can be downloaded from Zenodo (please note this is a different version of the one in the previous module).

load("./AS_GE_data.RData")

#Data: 
# info                    : data-frame of information for the nuclear samples
# myE_gen                 : noramlised gene expression count matrix of the CTRL nuclear fraction quantile-normalised
# dat_diff_time_nuc_ctrl  : data-frame with different AS analysis for the nuclear compartment
# dat_diff_time_cyto_ctrl : data-frame with different AS analysis for the cytoplasmic compartment
# tab_psi                 : data frame with all events in VAST-TOOLS and their PSI values in each sample 
# Percentage or proportion spliced-in (PSI or ĪØ)

#Here I make some nice colors to facilitate the visualisation
mygroup                <- factor(as.character(info$DIV),levels=c(0,3,7,14,22,35))
mycols_days            <- c("#CCFF00","#33CC33","#669999","#6699FF","#3300FF","#CC33CC")
names(mycols_days)     <- c(0,3,7,14,22,35)
mycols                 <- unlist(lapply(info$DIV,function(Z)return(mycols_days[match(as.character(Z),names(mycols_days))])))

Some useful functions for your analysis

#Biological Pathway Gene Enrichment Analysis
GO_analysis <- function(genes_list){
  gostres_diff <- gost(query = genes_list, 
                  organism = "hsapiens", ordered_query = FALSE, 
                  multi_query = FALSE, significant = TRUE, exclude_iea = FALSE, 
                  measure_underrepresentation = FALSE, evcodes = TRUE, 
                  user_threshold = 0.05, correction_method = "g_SCS", 
                  domain_scope = "annotated", custom_bg = NULL, 
                  numeric_ns = "", as_short_link = FALSE,sources=c("GO:BP", "GO:MF","KEGG"))
  gostplot(gostres_diff, capped = FALSE, interactive = TRUE)#please note this is going to create an interactive plot
}

#Plot PSI events over time in nuclear cyto fractions
plotEventOverTime <- function(event="HsaINT0002504"){
  toi     <- c("E.dPsi.d_3_0","E.dPsi.d_7_0","E.dPsi.d_14_0","E.dPsi.d_22_0","E.dPsi.d_35_0")
  dat_nuc <- dat_diff_time_nuc_ctrl[match(event,dat_diff_time_nuc_ctrl$EVENT),match(toi,colnames(dat_diff_time_nuc_ctrl))]
  dat_cyto<- dat_diff_time_cyto_ctrl[match(event,dat_diff_time_cyto_ctrl$EVENT),match(toi,colnames(dat_diff_time_cyto_ctrl))]
  tempdat <- rbind(as.numeric(dat_nuc),as.numeric(dat_cyto))
  rownames(tempdat)<-c("nuc","cyto")
  colnames(tempdat)<- c("3","7","14","22","35")
  tempdat[is.na(tempdat)]<-0
  barplot(tempdat,beside=TRUE,las=1,ylab="dPSI",xlab="DIV",main=paste(tab_psi$GENE[match(event,tab_psi$EVENT)], "-", tab_psi$COORD[match(event,tab_psi$EVENT)]),col=c("#CC9C3D","#476DB4"))
  legend("topleft",ncol=1,pch=15,col=c("#CC9C3D","#476DB4"),leg=c("nuc","cyto"),cex=0.7)
  grid()
}

#To following function sort myvec in decreasing order and return their indexes:
myidx<- sort(x=c(1:2000),index.return=TRUE,decreasing=TRUE)$ix

Alternative splicing analysis

Let’s first analyse the relative number of alternative splicing (AS) events exhibiting significant changes between the iPSC stage (DIV=0) and the NPC stage (DIV=14), and compare those between nuclear (dat_diff_time_nuc_ctrl data-table) and cytoplasmic fractions (dat_diff_time_cyto_ctrl). The schematic below shows the four main types of events we are going to look at.
Schematic depicting the four different types of alternative splicing events we are going to study. Alternative 5’ start site (Alt5); Alternative exon (AltEx); Alternative 3’ end side (Alt3); intron retention (IR)
Schematic depicting the four different types of alternative splicing events we are going to study. Alternative 5’ start site (Alt5); Alternative exon (AltEx); Alternative 3’ end side (Alt3); intron retention (IR)

Here is a summary of the column content which you can find at on Github repo of VAST-TOOL:

Get familiar with your data

First you always need to make sure you understand the format of your data, the content of the columns and the rows. Start by checking the dimensions of the data using dim, nrow, ncol functions. Then if the data-count table is not too large, you can use the View function to visualise and explore its content. Alternatively you can print a couple of rows. Here you need to understand the output from VAST-TOOLS.

dim(tab_psi)
## [1] 365711    102
print(colnames(tab_psi))
##   [1] "GENE"          "EVENT"         "COORD"         "LENGTH"       
##   [5] "FullCO"        "COMPLEX"       "0_CB1D_cyto"   "0_CB1D_nuc"   
##   [9] "0_CB1E_cyto"   "0_CB1E_nuc"    "0_CTRL1_cyto"  "0_CTRL1_nuc"  
##  [13] "0_CTRL3_cyto"  "0_CTRL3_nuc"   "0_CTRL4_cyto"  "0_CTRL4_nuc"  
##  [17] "0_CTRL5_cyto"  "0_CTRL5_nuc"   "0_GliA_cyto"   "0_GliA_nuc"   
##  [21] "0_GliB_cyto"   "14_CB1D_cyto"  "14_CB1D_nuc"   "14_CB1E_cyto" 
##  [25] "14_CB1E_nuc"   "14_CTRL1_cyto" "14_CTRL1_nuc"  "14_CTRL3_cyto"
##  [29] "14_CTRL3_nuc"  "14_CTRL4_cyto" "14_CTRL4_nuc"  "14_CTRL5_cyto"
##  [33] "14_CTRL5_nuc"  "14_GliA_cyto"  "14_GliA_nuc"   "14_GliB_cyto" 
##  [37] "14_GliB_nuc"   "22_CB1D_cyto"  "22_CB1D_nuc"   "22_CB1E_cyto" 
##  [41] "22_CB1E_nuc"   "22_CTRL1_cyto" "22_CTRL1_nuc"  "22_CTRL3_cyto"
##  [45] "22_CTRL3_nuc"  "22_CTRL4_cyto" "22_CTRL4_nuc"  "22_CTRL5_cyto"
##  [49] "22_CTRL5_nuc"  "22_GliA_cyto"  "22_GliA_nuc"   "22_GliB_cyto" 
##  [53] "22_GliB_nuc"   "35_CB1D_cyto"  "35_CB1D_nuc"   "35_CB1E_cyto" 
##  [57] "35_CB1E_nuc"   "35_CTRL1_cyto" "35_CTRL1_nuc"  "35_CTRL3_cyto"
##  [61] "35_CTRL3_nuc"  "35_CTRL4_cyto" "35_CTRL4_nuc"  "35_CTRL5_cyto"
##  [65] "35_CTRL5_nuc"  "35_GliA_cyto"  "35_GliA_nuc"   "35_GliB_cyto" 
##  [69] "35_GliB_nuc"   "3_CB1D_cyto"   "3_CB1D_nuc"    "3_CB1E_cyto"  
##  [73] "3_CB1E_nuc"    "3_CTRL1_cyto"  "3_CTRL1_nuc"   "3_CTRL3_cyto" 
##  [77] "3_CTRL3_nuc"   "3_CTRL4_cyto"  "3_CTRL4_nuc"   "3_CTRL5_cyto" 
##  [81] "3_CTRL5_nuc"   "3_GliA_cyto"   "3_GliA_nuc"    "3_GliB_cyto"  
##  [85] "3_GliB_nuc"    "7_CB1D_cyto"   "7_CB1D_nuc"    "7_CB1E_cyto"  
##  [89] "7_CB1E_nuc"    "7_CTRL1_cyto"  "7_CTRL1_nuc"   "7_CTRL3_cyto" 
##  [93] "7_CTRL3_nuc"   "7_CTRL4_cyto"  "7_CTRL4_nuc"   "7_CTRL5_cyto" 
##  [97] "7_CTRL5_nuc"   "7_GliA_cyto"   "7_GliA_nuc"    "7_GliB_cyto"  
## [101] "7_GliB_nuc"    "COMPLEX_short"
dim(dat_diff_time_nuc_ctrl)
## [1] 365711     21
print(colnames(dat_diff_time_cyto_ctrl))
##  [1] "GENE"            "EVENT"           "COORD"           "LENGTH"         
##  [5] "FullCO"          "COMPLEX"         "E.dPsi.d_3_0"    "MV.dPsi.d_3_0"  
##  [9] "E.dPsi.d_7_0"    "MV.dPsi.d_7_0"   "E.dPsi.d_7_3"    "MV.dPsi.d_7_3"  
## [13] "E.dPsi.d_14_0"   "MV.dPsi.d_14_0"  "E.dPsi.d_22_0"   "MV.dPsi.d_22_0" 
## [17] "E.dPsi.d_35_0"   "MV.dPsi.d_35_0"  "E.dPsi.d_35_22"  "MV.dPsi.d_35_22"
## [21] "COMPLEX_short"
dim(dat_diff_time_cyto_ctrl)
## [1] 365711     21
print(colnames(dat_diff_time_nuc_ctrl))
##  [1] "GENE"            "EVENT"           "COORD"           "LENGTH"         
##  [5] "FullCO"          "COMPLEX"         "E.dPsi.d_3_0"    "MV.dPsi.d_3_0"  
##  [9] "E.dPsi.d_7_0"    "MV.dPsi.d_7_0"   "E.dPsi.d_7_3"    "MV.dPsi.d_7_3"  
## [13] "E.dPsi.d_14_0"   "MV.dPsi.d_14_0"  "E.dPsi.d_22_0"   "MV.dPsi.d_22_0" 
## [17] "E.dPsi.d_35_0"   "MV.dPsi.d_35_0"  "E.dPsi.d_35_22"  "MV.dPsi.d_35_22"
## [21] "COMPLEX_short"

Overview of the relative number of changes of specific types between nucleus and cytoplasm

It is generally good to get an understanding of the type of events which are differentially spliced between two conditions. Pie-charts can be useful to visualise the relative abundances of events in each categories between different types of conditions. \(\Delta\)PSI>0 indicates inclusion while \(\Delta\)PSI<0 indicates skipping between two conditions. The default threshold of a difference in PSI is 15%.

Let’s start by looking at the relative number of events with a threshold difference of 20% between the iPSC stage (DIV=0) and the MN stage (DIV=35) in the nucleus:

col_events=c("#CD4599","#8C8C8C","#6ABD45","#4B5493","#FEC20F")
names(col_events) <- names(table(dat_diff_time_nuc_ctrl$COMPLEX_short))
par(mfrow=c(1,3),mar=c(1,1,2,1))
pie(table(dat_diff_time_nuc_ctrl$COMPLEX_short),col=col_events)
mtext(side=3,line=-1,text="All annotated events",cex=0.7)
pie(table(dat_diff_time_nuc_ctrl$COMPLEX_short[dat_diff_time_nuc_ctrl$E.dPsi.d_35_0>0.2]),col=col_events)
mtext(side=3,line=0,text="NUCLEUS -- DIV=35",cex=0.7)
mtext(side=3,line=-1,text="dPSI>20%",cex=0.7)
pie(table(dat_diff_time_nuc_ctrl$COMPLEX_short[dat_diff_time_nuc_ctrl$E.dPsi.d_35_0<(-0.2)]),col=col_events)
mtext(side=3,line=-1,text="dPSI<(-20%)",cex=0.7)

And compare these with event distribution in cytoplasmic fractions:

par(mfrow=c(1,3),mar=c(1,1,2,1))
pie(table(dat_diff_time_cyto_ctrl$COMPLEX_short),col=col_events)
mtext(side=3,line=-1,text="All annotated events",cex=0.7)
pie(table(dat_diff_time_cyto_ctrl$COMPLEX_short[dat_diff_time_cyto_ctrl$E.dPsi.d_35_0>0.2]),col=col_events)
mtext(side=3,line=0,text="CYTOPLASM --DIV=35",cex=0.7)
mtext(side=3,line=-1,text="dPSI>20%",cex=0.7)
pie(table(dat_diff_time_cyto_ctrl$COMPLEX_short[dat_diff_time_cyto_ctrl$E.dPsi.d_35_0<(-0.2)]),col=col_events)
mtext(side=3,line=-1,text="dPSI<(-20%)",cex=0.7)

What are the key observations?

Comparisons of the AS changes detected in nucleus and cytoplasm over time

Having found some differences at MN stages in relative distributions of included events between cyto and nuc (enrichment in nuclear fraction of IR), let’s now test whether splicing patterns are similar in nucleus versus cytoplasm over time.

pcc_events<- do.call(what=cbind,args=lapply(unique(dat_diff_time_cyto_ctrl$COMPLEX_short),function(EV)return(unlist(lapply(c("E.dPsi.d_3_0",   "E.dPsi.d_7_0" , "E.dPsi.d_14_0" , "E.dPsi.d_22_0" , "E.dPsi.d_35_0"),function(coln)return(cor(dat_diff_time_nuc_ctrl[dat_diff_time_nuc_ctrl$COMPLEX_short==EV,match(coln,colnames(dat_diff_time_nuc_ctrl))],dat_diff_time_cyto_ctrl[dat_diff_time_cyto_ctrl$COMPLEX_short==EV,match(coln,colnames(dat_diff_time_cyto_ctrl))],use = "complete")))))))
colnames(pcc_events)<- unique(dat_diff_time_cyto_ctrl$COMPLEX_short)
rownames(pcc_events)<-c("E.dPsi.d_3_0",   "E.dPsi.d_7_0" , "E.dPsi.d_14_0" , "E.dPsi.d_22_0" , "E.dPsi.d_35_0") 

par(mfrow=c(1,1))
barplot(pcc_events,beside=TRUE,col=unlist(lapply(colnames(pcc_events),function(Z)return(rep(col_events[match(Z,names(col_events))],5)))),las=1,ylim=c(0,1),ylab="Pearson Correlation Coefficient")

Let’s now visualise these results, in particular the changes between NPC and MN stages in terms of alternative exon usage and intron retention:

par(mfrow=c(2,2),mar=c(4,4,4,4))
plot(dat_diff_time_nuc_ctrl$E.dPsi.d_14_0[dat_diff_time_nuc_ctrl$COMPLEX_short=="IR"],dat_diff_time_cyto_ctrl$E.dPsi.d_14_0[dat_diff_time_cyto_ctrl$COMPLEX_short=="IR"],pch=19,col=rgb(0,0,0,0.2),cex=0.5,xlab="nucleus",ylab="cytoplasm",main="",las=1)
mtext(side=3,line=2,text="Intron Retention",cex=0.7)
mtext(side=3,line=1,text="dPSI [DIV=0-->DIV=14]",cex=0.7)
grid()
abline(0,1,col="red",lty=2)
mtext(side=3,line=0,text=paste("PCC=",round(cor(dat_diff_time_nuc_ctrl$E.dPsi.d_14_0[dat_diff_time_nuc_ctrl$COMPLEX_short=="IR"],dat_diff_time_cyto_ctrl$E.dPsi.d_14_0[dat_diff_time_cyto_ctrl$COMPLEX_short=="IR"],use = "complete"),digit=2)),cex=0.7)

plot(dat_diff_time_nuc_ctrl$E.dPsi.d_14_0[dat_diff_time_nuc_ctrl$COMPLEX_short=="EX"],dat_diff_time_cyto_ctrl$E.dPsi.d_14_0[dat_diff_time_cyto_ctrl$COMPLEX_short=="EX"],pch=19,col=rgb(0,0,0,0.2),cex=0.5,xlab="nucleus",ylab="cytoplasm",main="",las=1)
mtext(side=3,line=2,text="Alternative Exon",cex=0.7)
mtext(side=3,line=1,text="dPSI [DIV=0-->DIV=14]",cex=0.7)
mtext(side=3,line=0,text=paste("PCC=",round(cor(dat_diff_time_nuc_ctrl$E.dPsi.d_14_0[dat_diff_time_nuc_ctrl$COMPLEX_short=="EX"],dat_diff_time_cyto_ctrl$E.dPsi.d_14_0[dat_diff_time_cyto_ctrl$COMPLEX_short=="EX"],use = "complete"),digit=2)),cex=0.7)
grid()
abline(0,1,col="red",lty=2)

plot(dat_diff_time_nuc_ctrl$E.dPsi.d_35_0[dat_diff_time_nuc_ctrl$COMPLEX_short=="IR"],dat_diff_time_cyto_ctrl$E.dPsi.d_35_0[dat_diff_time_cyto_ctrl$COMPLEX_short=="IR"],pch=19,col=rgb(0,0,0,0.2),cex=0.5,xlab="nucleus",ylab="cytoplasm",main="",las=1)
mtext(side=3,line=2,text="Intron Retention",cex=0.7)
mtext(side=3,line=1,text="dPSI [DIV=0-->DIV=35]",cex=0.7)
mtext(side=3,line=0,text=paste("PCC=",round(cor(dat_diff_time_nuc_ctrl$E.dPsi.d_35_0[dat_diff_time_nuc_ctrl$COMPLEX_short=="IR"],dat_diff_time_cyto_ctrl$E.dPsi.d_35_0[dat_diff_time_cyto_ctrl$COMPLEX_short=="IR"],use = "complete"),digit=2)),cex=0.7)
grid()
abline(0,1,col="red",lty=2)

plot(dat_diff_time_nuc_ctrl$E.dPsi.d_35_0[dat_diff_time_nuc_ctrl$COMPLEX_short=="EX"],dat_diff_time_cyto_ctrl$E.dPsi.d_35_0[dat_diff_time_cyto_ctrl$COMPLEX_short=="EX"],pch=19,col=rgb(0,0,0,0.2),cex=0.5,xlab="nucleus",ylab="cytoplasm",main="",las=1)
mtext(side=3,line=2,text="Alternative Exon",cex=0.7)
mtext(side=3,line=1,text="dPSI [DIV=0-->DIV=35]",cex=0.7)
mtext(side=3,line=0,text=paste("PCC=",round(cor(dat_diff_time_nuc_ctrl$E.dPsi.d_35_0[dat_diff_time_nuc_ctrl$COMPLEX_short=="EX"],dat_diff_time_cyto_ctrl$E.dPsi.d_35_0[dat_diff_time_cyto_ctrl$COMPLEX_short=="EX"],use = "complete"),digit=2)),cex=0.7)
grid()
abline(0,1,col="red",lty=2)

If interested have a look at the role of nuclear detention of intron-retaining transcripts.

Singular value decomposition (SVD)

While hierarchical clustering might be dominated by some changes in few genes, other methods such as singular value decomposition (SVD) analysis can be instrumental in identifying independent biological pathways underlying changes in gene expression. It is also very appropriate for time-series analysis when variation is not linearly dependent on time.

Relevant functions for SVD analysis

#This function extract the fraction of variance per component as derived from SVD analysis
#Input= SVD object 
getFractionVariance<- function(mySVD){
  return(mySVD$d*mySVD$d/sum(mySVD$d*mySVD$d))
}

#This function calculate the Shannon Entropy which indicates how well the information is spread among the principal components.
#High value indicates information evenly distributed among components
#Low value indicates most information/variance is captured by a single component
#Takes as input the fraction of variance outputted from getFractionVariance
getShannonEntropy <- function(pip)return(-1*sum(pip*log10(pip))/log10(length(pip)))

#This function plots the fraction of variance captured by first N components.
#pi=fraction of variance as obtained from getFractionVariance
#dp=Shannon Entropy as obtained from getShannonEntropy
#N=number of components to plot
ScrePlot <- function(pi,dp,N=NA){
  if(is.na(N)){
  barplot(pi,las=1,cex.main=0.7,cex.axis=1.0,col="black")
  }
  else{
    barplot(pi[c(1:N)],las=1,cex.main=0.7,cex.axis=1.0,col="black")
  }
  mtext(side = 1, line = 2, text ="principal components", col = "black",cex=0.7, font=1)
  mtext(side = 2, line = 2, text ="Fraction of explained variance", col = "black",cex=0.7, font=1)
  mtext(side = 3, line = -2, text = paste("Shannon Entropy = ",round(dp,digits=3)), col = "black",cex=0.7, font=1)
}

Prepare relevant matrices

For this analysis we will focus on the nuclear samples and use the different types of data collected so far i.e.Ā gene expression, PUD, IR and AltEx. A similar analysis can be done on the cytoplasmic fraction.

ge                <- myE_gen
EX_nuc            <- as.matrix(tab_psi[which(tab_psi$COMPLEX_short=="EX"),match(colnames(myE_gen),colnames(tab_psi))])
IR_nuc            <- as.matrix(tab_psi[which(tab_psi$COMPLEX_short=="IR"),match(colnames(myE_gen),colnames(tab_psi))])

rownames(EX_nuc)  <- as.character(tab_psi$EVENT)[which(tab_psi$COMPLEX_short=="EX")]
rownames(IR_nuc)  <- as.character(tab_psi$EVENT)[which(tab_psi$COMPLEX_short=="IR")]

#You need to remove NA's and NaN for the unsupervised analysis
tosel_ex          <- which(apply(is.na(EX_nuc),1,sum)==0&apply(is.nan(EX_nuc),1,sum)==0)
tosel_ir          <- which(apply(is.na(IR_nuc),1,sum)==0&apply(is.nan(IR_nuc),1,sum)==0)
EX_nuc            <- EX_nuc[tosel_ex,]
IR_nuc            <- IR_nuc[tosel_ir,]

Systematic removal of the first component: effect on GE matrix

It is common in SVD analysis to remove the first component which captures noise == most variations from from systematic changes in basal gene expression between genes.

#SVD on the Gene Expression
dat1         <- ge
SVD_eset     <- svd(dat1)
pi_1         <- getFractionVariance(mySVD=SVD_eset)
dp_1         <- getShannonEntropy(pi_1)


datm         <- dat1-SVD_eset$d[1]*(SVD_eset$u[,1]%*%t(SVD_eset$v[,1]))
SVD_eset_GE  <- svd(datm)
pi_2         <- getFractionVariance(mySVD=SVD_eset_GE)
dp_2         <- getShannonEntropy(pi_2)


par(mfrow=c(2,4))
ScrePlot(pi_1,dp_1,N=NA)
mtext(side=3,line=0,text="prior removing first component")
#Effect on the sample distribution
multidensity(dat1[,c(1,10,15,20)],main="prior filtering",las=1)
boxplot(dat1,outline=FALSE)
boxplot(t(dat1[c(1,4,10,44,69,1000,2000,4000),]),outline=FALSE,las=2,ylab="log2(count)")

ScrePlot(pi_2,dp_2,N=NA)
mtext(side=3,line=0,text="after removing first component")
multidensity(datm[,c(1,10,15,20)],main="after filtering",las=1)
boxplot(datm,outline=FALSE)
#Effect on the genes distribution
boxplot(t(datm[c(1,4,10,44,69,1000,2000,4000),]),outline=FALSE,las=2,ylab="log2(count)")

Perform SVD on individual matrices (AltEx, IR, and GE)

We start by performing SVD on individual matrices and compare how information is captured in the different matrices.

layout(matrix(c(1:6),ncol=3,nrow=2,byrow=FALSE))

#1. SVD on the Gene Expression
dat1         <- ge
SVD_eset     <- svd(dat1)
ScrePlot(getFractionVariance(SVD_eset),getShannonEntropy(getFractionVariance(SVD_eset)))
mtext(side=3,line=0,text="Gene Expression")
datm         <- dat1-SVD_eset$d[1]*(SVD_eset$u[,1]%*%t(SVD_eset$v[,1]))
SVD_eset_GE  <- svd(datm)
ScrePlot(getFractionVariance(SVD_eset_GE),getShannonEntropy(getFractionVariance(SVD_eset_GE)))
mtext(side=3,line=0,text="Gene Expression")
#2. SVD on the AltEx
dat1         <- EX_nuc
SVD_eset     <- svd(dat1)
ScrePlot(getFractionVariance(SVD_eset),getShannonEntropy(getFractionVariance(SVD_eset)))
mtext(side=3,line=0,text="AltEx")
mtext(side=3,line=1,text="Prior removing first component")
datm         <- dat1-SVD_eset$d[1]*(SVD_eset$u[,1]%*%t(SVD_eset$v[,1]))
SVD_eset_Ex  <- svd(datm)
ScrePlot(getFractionVariance(SVD_eset_Ex),getShannonEntropy(getFractionVariance(SVD_eset_Ex)))
mtext(side=3,line=1,text="After removing first component")
mtext(side=3,line=0,text="AltEx")
#3. SVD on the IR
dat1         <- IR_nuc
SVD_eset     <- svd(dat1)
ScrePlot(getFractionVariance(SVD_eset),getShannonEntropy(getFractionVariance(SVD_eset)))
mtext(side=3,line=0,text="IR")
datm         <- dat1-SVD_eset$d[1]*(SVD_eset$u[,1]%*%t(SVD_eset$v[,1]))
SVD_eset_IR  <- svd(datm)
ScrePlot(getFractionVariance(SVD_eset_IR),getShannonEntropy(getFractionVariance(SVD_eset_IR)))
mtext(side=3,line=0,text="IR")

Let’s now compare their first PCs to test who the time is captured by the different matrices in the different SVD.

#PCA plots
par(mfrow=c(3,3),mar=c(4,4,2,2))
plot(SVD_eset_GE$v[,1], SVD_eset_GE$v[,2],col=mycols,xlab="PC1",ylab="PC2",main="Gene Expression",pch=19,cex=1.5,las=1)
plot(SVD_eset_Ex$v[,1], SVD_eset_Ex$v[,2],col=mycols,xlab="PC1",ylab="PC2",main="AltEx",pch=19,cex=1.5,las=1)
plot(SVD_eset_IR$v[,1], SVD_eset_IR$v[,2],col=mycols,xlab="PC1",ylab="PC2",main="IR",pch=19,cex=1.5,las=1)


plot(SVD_eset_GE$v[,1], SVD_eset_GE$v[,3],col=mycols,xlab="PC1",ylab="PC3",main="Gene Expression",pch=19,cex=1.5,las=1)
plot(SVD_eset_Ex$v[,1], SVD_eset_Ex$v[,3],col=mycols,xlab="PC1",ylab="PC3",main="AltEx",pch=19,cex=1.5,las=1)
plot(SVD_eset_IR$v[,1], SVD_eset_IR$v[,3],col=mycols,xlab="PC1",ylab="PC3",main="IR",pch=19,cex=1.5,las=1)


plot(SVD_eset_GE$v[,2], SVD_eset_GE$v[,3],col=mycols,xlab="PC2",ylab="PC3",main="Gene Expression",pch=19,cex=1.5,las=1)
plot(SVD_eset_Ex$v[,2], SVD_eset_Ex$v[,3],col=mycols,xlab="PC2",ylab="PC3",main="AltEx",pch=19,cex=1.5,las=1)
plot(SVD_eset_IR$v[,2], SVD_eset_IR$v[,3],col=mycols,xlab="PC2",ylab="PC3",main="IR",pch=19,cex=1.5,las=1)

Do you see any differences in the clustering of the samples using the different count matrices?

The dynamic of the components over time

Let’s now look into the dynamic over time of the different component:

myMean <- list( t(apply(SVD_eset_GE$v,2,function(x)return(tapply(x,INDEX=mygroup,FUN=mean)))),
                t(apply(SVD_eset_Ex$v,2,function(x)return(tapply(x,INDEX=mygroup,FUN=mean)))),
                t(apply(SVD_eset_IR$v,2,function(x)return(tapply(x,INDEX=mygroup,FUN=mean))))               )
mySD  <- list(t(apply(SVD_eset_GE$v,2,function(x)return(tapply(x,INDEX=mygroup,FUN=sd)))),
              t(apply(SVD_eset_Ex$v,2,function(x)return(tapply(x,INDEX=mygroup,FUN=sd)))),
              t(apply(SVD_eset_IR$v,2,function(x)return(tapply(x,INDEX=mygroup,FUN=sd))))              )

#Plot component over time
days                <- c(0,3,7,14,21,35)
cats                <- c("Gene Expression","AltEx","IR","3UTR")
CEX<- 0.7
par(mfrow=c(3,3),mar=c(5,5,2,0),oma=c(2,2,2,2))
for(j in c(1:3)){
  for(i in c(1:3)){
    MIN=min(0,min(myMean[[j]][i,]-mySD[[j]][i,]))
    MAX=max(myMean[[j]][i,]+mySD[[j]][i,])
    plot(days,myMean[[j]][i,],pch=19,type="b",lty=2,ylim=c(MIN,MAX),las=1,frame="F",xlab="time [days]",cex=1.0,cex.axis=CEX,cex.lab=CEX,ylab="")
    mtext(side=2,line=3,text=paste("PC",i,sep=""),cex=CEX)
    mtext(side=3,line=0,text=cats[j],cex=CEX)
    grid()
    abline(h=0,col="red",lty=2)
  }
}

Graded homework

Your homework will be graded according to three criteria: 1) Correctness of your result; 2) Clarity of the visual output; 3) Description of your results demonstrating your ability to discuss your result in their biological context.

Taks 2. Hierarchical clustering analysis (0.25 pt)

  • Use hierarchical clustering analysis to test whether the time component is differentially captured by changes in gene expression, AltEx, and intron retention the nuclear fractions.
# We use the Deseq2 method for all samples
CEX <- 1
hc_ge <- hclust(dist(t(ge),method="man"), method = "ward.D", members = NULL)
hc_EX_nuc <- hclust(dist(t(EX_nuc),method="man"), method = "ward.D", members = NULL)
hc_IR_nuc <- hclust(dist(t(IR_nuc),method="man"), method = "ward.D", members = NULL)

par(mfrow=c(1,3))
par(xpd=NA,oma=c(0,0,2,0))

plot(ape::as.phylo(hc_ge),tip.color=mycols,cex=CEX,label.offset = 0.001,no.margin = TRUE,use.edge.length=TRUE,direction="rightwards",plot=TRUE,font=1,main="Gene expression")
plot(ape::as.phylo(hc_EX_nuc),tip.color=mycols,cex=CEX,label.offset = 0.001,no.margin = TRUE,use.edge.length=TRUE,direction="rightwards",plot=TRUE,font=1,main="AltEx")
plot(ape::as.phylo(hc_IR_nuc),tip.color=mycols,cex=CEX,label.offset = 0.001,no.margin = TRUE,use.edge.length=TRUE,direction="rightwards",plot=TRUE,font=1,main="IR")

In all these different hierarchical clustering it is possible to see that the samples from the same DIV are very similar and they are the first ones to be merged.

Considering gene expression plot, the most similar are DIV 3 and 7. The second most similar are DIV 22 and 35. Then DIV 0 is merged with DIV {3, 7} and then DIV 14 is merged with DIV {0, 3, 7}. The last two clusters to be merged are DIV {0, 3, 7, 14} and DIV {22, 35} and these two are very far compared to all the other merges.

Considering alternative exon, we still have that the last two clusters to be merged are DIV {0, 3, 7, 14} and DIV {22, 35}, which are quite far apart. What changes is that DIV 14 is merged with cluster {0, 3} before 0 (while in gene expression DIV 0 was merged with {0, 3} before DIV 14).

Considering intro retention, we still have that the last two clusters to be merged are DIV {0, 3, 7, 14} and DIV {22, 35}, which are quite far apart. In this case, DIV 7 and 14 are marged at first, then DIV 3 and at the end DIV 0.

Task 3. SVD analysis (0.25 pt)

  • Extract the top 500 most positively and 500 most negatively contributing genes to the first 5 components of SVD from IR .
  • Use boxplot to display their behaviours over time.
# Use left singular matrix of the SVD for this task
SVD_eset <- svd(IR_nuc)

# Visualise the impact of the different PC components on the fraction of explained variance
ScrePlot(getFractionVariance(SVD_eset),getShannonEntropy(getFractionVariance(SVD_eset)))
mtext(side=3,line=0,text="IR before noise removal")

The main source of variance is given by the noise, so we remove the first component.

# Remove the first component
datm <- IR_nuc-SVD_eset$d[1]*(SVD_eset$u[,1]%*%t(SVD_eset$v[,1]))

# Visualise the impact of the different PC components on the fraction of explained variance
SVD_eset_after  <- svd(datm)
ScrePlot(getFractionVariance(SVD_eset_after),getShannonEntropy(getFractionVariance(SVD_eset_after)))
mtext(side=3,line=0,text="IR after noise removal")

# For the first 5 PC components, we need to sort the vectors to obtain the 500 most positively and negatively contributing genes. The PC components in SVD_eset are already sorted.

indeces <- c(1,2,3,4,5)
colors <- rep(c("blue", "red", "green", "orange", "purple", "yellow"), each = 4)
times <- rep(c("DIV0", "DIV3", "DIV7", "DIV14", "DIV22", "DIV35"), each = 4)

first_500 <- lapply(indeces, function(index) {
  # Extract the scores
  PC <- SVD_eset_after$u[,index]
  
  # Get the ordered indices
  sorted_indices <- sort(x=PC, index.return=TRUE, decreasing=TRUE)$ix
  
  # Get the indices of the top positive and top negative
  top_positively_indices <- sorted_indices[1:500]
  top_negatively_indices <- tail(sorted_indices, 500)
  
  # Get the rows from IR_nuc corresponding to the top positive and top negative genes
  most_positive_nuc <- IR_nuc[top_positively_indices,]
  most_negative_nuc <- IR_nuc[top_negatively_indices,]
  
  par(mfrow=c(2, 1))
  # Plot the most positive and most negative genes
  boxplot(most_positive_nuc, main = paste0("Top 500 most positively contributing genes, PC component: ", index), ylab = "PSI", xlab = "Time points", col = colors, las = 2, names=times)
  
  boxplot(most_negative_nuc, main = paste0("Top 500 most negatively contributing genes, PC component: ", index), ylab = "PSI", xlab = "Time points", col = colors, las = 2, names=times)

  most_positive_nuc_events <- rownames(most_positive_nuc)
  most_negative_nuc_events <- rownames(most_negative_nuc)
  
  # Get the name of the genes, for the GO analysis
  most_positive_nuc_genes <- dat_diff_time_nuc_ctrl[dat_diff_time_nuc_ctrl$EVENT %in% most_positive_nuc_events, ]$GENE
  most_negative_nuc_genes <- dat_diff_time_nuc_ctrl[dat_diff_time_nuc_ctrl$EVENT %in% most_negative_nuc_events, ]$GENE
  list(positive = most_positive_nuc_genes, negative = most_negative_nuc_genes)
})

Comment: The first component is the one that primarily considers time. As time progresses, the mean value increases for the negatively contributing genes, and it decreases for positive contributing genes.

Task 4 (Bonus). Enrichment of relevant PCi (0.25 pt)

Perform BP enrichment analysis on relevant PCs as identified in previous point.

GO analysis on the first component

GO_analysis(first_500[[1]]$positive)

The 500 genes most positively contributing to the principal component 1 are related to transport and localization.

GO_analysis(first_500[[1]]$negative)

The 500 genes most negatively contributing to the principal component 1 are related to organization processes.

GO analysis on the second component

GO_analysis(first_500[[2]]$positive)

The 500 genes most positively contributing to the principal component 2 are related to organelle organization, transport and localization.

GO_analysis(first_500[[2]]$negative)

The 500 genes most negatively contributing to the principal component 2 are related to organelle organization, metabolic processes and regularization processes.

GO analysis on the third component

GO_analysis(first_500[[3]]$positive)

The 500 genes most positively contributing to the principal component 3 are related to developmental processes.

GO_analysis(first_500[[3]]$negative)

The 500 genes most negatively contributing to the principal component 3 are related to organelle organization, transport, vesicle-mediated transport, actin cytoskeleton organization, actin filament-based process.

GO analysis on the fourth component

GO_analysis(first_500[[4]]$positive)

The 500 genes most positively contributing to the principal component 4 are related to regulation processes, localization and neurotransmitter receptor transport.

GO_analysis(first_500[[4]]$negative)

The 500 genes most negatively contributing to the principal component 4 are related to developmental processes (nervous system development, system development, multicellular organism development) and cell morphogenesis.

GO analysis on the fifth component

GO_analysis(first_500[[5]]$positive)

The 500 genes most positively contributing to the principal component 5 are related to cellular response to stress, regulation processes and developmental processes

GO_analysis(first_500[[5]]$negative)

The 500 genes most negatively contributing to the principal component 5 are related to regulation processes, modification processes, organic cyclic compound biosynthetic process.